[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package#37874
Conversation
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
There was a problem hiding this comment.
Code Review
This pull request is an excellent refactoring that consolidates the LRUOffloadingManager and ARCOffloadingManager into a single CPUOffloadingManager using a strategy pattern with pluggable CachePolicy implementations. This significantly reduces code duplication and improves modularity. A key improvement is making the eviction logic atomic, preventing partial evictions on failure. The implementation of the strategy pattern, including the CachePolicy ABC and the policy registry, is clean and extensible. The changes are well-tested. I have one suggestion to improve the robustness of the LRUCachePolicy.
vllm/v1/kv_offload/cpu_manager.py
Outdated
| self.blocks[block_hash] = block | ||
|
|
||
| def remove(self, block_hash: BlockHash) -> None: | ||
| del self.blocks[block_hash] |
There was a problem hiding this comment.
The remove method in LRUCachePolicy uses del self.blocks[block_hash], which will raise a KeyError if the block hash is not present. While the current call site in CPUOffloadingManager.complete_store ensures the key exists before calling remove, making this method more robust would be beneficial for future maintenance and to prevent potential crashes if the calling logic changes. The ARCCachePolicy implementation already uses a safer pop method. Consider using self.blocks.pop(block_hash, None) to silently handle cases where the block is not in the cache, which improves defensiveness.
| del self.blocks[block_hash] | |
| self.blocks.pop(block_hash, None) |
vllm/v1/kv_offload/cpu.py
Outdated
| ) | ||
| self._manager = CPUOffloadingManager( | ||
| backend=backend, | ||
| cache_policy=cast(Literal["lru", "arc"], self.eviction_policy), |
There was a problem hiding this comment.
IMO since self.eviction_policy comes from user config (line 59), wouldn't be cleaner to pass the raw string with # type: ignore[arg-type] and let CPUOffloadingManager.__init__ handle the validation? It already raises ValueError for unknown policies, and it feels like cast() silences mypy but doesn't validate at runtime.
There was a problem hiding this comment.
Thanks, good point. I'll replace cast() with # type: ignore[arg-type] and rely on the runtime validation in CPUOffloadingManager.__init__.
orozery
left a comment
There was a problem hiding this comment.
I would suggest the following file structure:
vllm/v1/kv_offload/cpu/manager.py
vllm/v1/kv_offload/cpu/spec.py
vllm/v1/kv_offload/cpu/policies/abstract.py
vllm/v1/kv_offload/cpu/policies/lru.py
vllm/v1/kv_offload/cpu/policies/arc.py
@ronensc WDYT?
vllm/v1/kv_offload/cpu_manager.py
Outdated
|
|
||
| def __init__( | ||
| self, | ||
| backend: Backend, |
There was a problem hiding this comment.
We should remove the Backend abstraction and merge the code of CPUBackend inside CPUOffloadingManager.
There was a problem hiding this comment.
Thanks, this makes sense. I'll merge CPUBackend into CPUOffloadingManager.
I also like the proposed file structure. A couple of questions:
- Should the
worker/directory also be moved undercpu/? - Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?
There was a problem hiding this comment.
- Should the
worker/directory also be moved undercpu/?
We need to decide if we want to split directories based on worker/scheduler.
Let's think about that later.
- Do you prefer handling the file structure reorganization in this PR, or as a follow-up PR?
Let's do this here.
There was a problem hiding this comment.
Done.
@orozery ready for a 2nd review round.
There was a problem hiding this comment.
Updated PR title and description
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
cpu/ package
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ackend abstraction, restructure into `cpu/` package (vllm-project#37874) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
### What this PR does / why we need it? main2main upgrade to vllm 0324. fix breaks: 1. PR [#37487](vllm-project/vllm#37487) [V0 Deprecation] Refactor kv cache from list to element (c59a132f9) — self.kv_cache from list[tensor](per virtual engine)changed to tensor 2. PR [#37874](vllm-project/vllm#37874) [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (e3c6c10ca) — LRUOffloadingManager + CPUBackend been refactor to CPUOffloadingManager 3. PR [#32951](vllm-project/vllm#32951) [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (fafe76b4a) — a) changes self.positions and self.seq_lens from CpuGpuBuffer to plain GPU tensor; b) change _get_cumsum_and_arange output paramter. Another _prepare_input_ids add num_reqs. 5. PR [#35007](https://github.com/vllm-project/vllm/pull/35007)[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (dc6908ac6) — delete vllm_is_batch_invariant() and const variable VLLM_BATCH_INVARIANT,replace with vllm.envs Know issues: 1.310p Qwen3.5 test failed for qwen3.5 patch failure, see issue: #7976 @YangShuai52 is fixing. ### Does this PR introduce _any_ user-facing change? 1. As Zero Async Scheduler + spec decode needs _compute_slot_mapping_kernel of NPU and corresponding accepted draft token validation delaye suppots see PR #7640 , this PR make this change: when in spec decode case close the async scheduler. In this way, the Main2Main can be developed in parallel with Spec Decode + Async scheduler, util next release version. Co-Authored-By: zhaomingyu <zhaomingyu13@h-partners.com> wangbj127 <wangbj1207@126.com> SidaoY <1024863041@qq.com> 22dimensions <waitingwind@foxmail.com> - vLLM main: vllm-project/vllm@35141a7 --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Your Name <you@example.com> Signed-off-by: wangbj127 <wangbj1207@126.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Claude Code <claude@anthropic.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: wangbj127 <wangbj1207@126.com>
Purpose
Refactor the CPU KV-cache offloading subsystem to reduce duplication, remove
unnecessary abstraction layers, and improve code organization.
Consolidate LRU/ARC managers using a strategy pattern
arc_manager.pyandlru_manager.pyduplicated ~40 lines of identical skeletoncode:
take_events, event emission, ref-count management inprepare_load/complete_load,backend.allocate_blocks(), and__init__boilerplate. These arenow unified in
CPUOffloadingManager, with policy-specific logic isolated inCachePolicyimplementations.Remove the
BackendabstractionThe
BackendABC andCPUBackendclass added a layer of indirection with nopolymorphism benefit. The block pool logic is now inlined directly into
CPUOffloadingManageras private methods.Restructure into a
cpu/packagePer reviewer suggestion, the flat files are split into a proper subdirectory with
one responsibility per file:
Key design decisions
CachePolicyabstract base covers both block organization and replacementdecisions — LRU and ARC differ in both, so they cannot be cleanly split
_CACHE_POLICIESregistry dict — adding a new policyrequires no changes to
CPUOffloadingManagerevict()implementations are now atomic: candidates are collectedwithout mutating state; changes only apply if all
nevictions can be satisfied(the original ARC code could partially evict before returning
None)LRUCachePolicy.touch()fixed:if self.blocks.get(hash):→if hash in self.blocks:(the old check was unreliable for blocks with
ref_cnt == 0)Files deleted:
arc_manager.py,lru_manager.py,cpu.py,cpu_manager.py,backend.py,backends/cpu.pyTest Plan
Test Result
cc @orozery @albertoperdomo2
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.